Setup

Run if you don’t have the packages installed.

# a helper abbreviation
`%not in%` <- Negate(`%in%`)

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

needed_packages <- 
  c("recount3", "maftools", "DESeq2", "TCGAbiolinks", "biomaRt")

for (pkg in needed_packages) {
  if (pkg %not in% rownames(installed.packages())) {
    print(paste("Trying to install", pkg))
    BiocManager::install(pkg)
    if ((pkg %not in% rownames(installed.packages()))) {
      msg <- paste("ERROR: Unsuccessful!", pkg, "not installed!",
                   "Check the log and try installing the package manually.")
      stop(msg)
    } 
  }
  library(pkg, character.only = TRUE)
  ifelse(pkg %in% loadedNamespaces(), 
         print(paste("Successful!", pkg, "loaded.")),
         print(paste("ERROR: Unsuccessful!", pkg, 
                     "not loaded. Check error msg.")))
}

pkg <- "InteractiveComplexHeatmap"
if (pkg %not in% installed.packages()) {
  print(paste("Trying to install", pkg))
  remotes::install_github("jokergoo/InteractiveComplexHeatmap", 
                          upgrade = "always")
  library(pkg, character.only = TRUE)
  ifelse(pkg %in% loadedNamespaces(), 
         print(paste("Successful!", pkg, "loaded.")),
         print(paste("ERROR: Unsuccessful!", pkg, 
                     "not loaded. Check error msg.")))
}

Load all the packages and define functions.

# for unified and processed RNA-seq data
library(recount3)
# to normalize the RNA-seq data 
library(DESeq2) 
# for access to TCGA data
library(TCGAbiolinks)
# to look at the data
library(tidyverse)
# to visualize the mutation data
library(maftools)
# to create heatmaps
library(ComplexHeatmap)

scale2 <- function(mat, ...) {
  t(scale(t(mat), ...))
}

Gene expression

Preparing the data

Using recount3 we download the data for a Lower Grade Glioma (LGG). In order to read more about the package and explore more of its function, refer to the manual and quick guide.

rse_gene <- 
  create_rse(
    subset(
      available_projects(),
      project == "LGG" & project_type == "data_sources"
    )
  )

Now, let’s explore what are we looking at?

assayNames(rse_gene)
## [1] "raw_counts"

We need to scale the reads to be able to use them in DESeq2 processing.

assay(rse_gene, "counts") <- 
  transform_counts(rse_gene)

The attached colData contains a lot of information that we can use on top of the expression data.

sample_sheet <-
  colData(rse_gene) %>%
  data.frame() %>%
  rownames_to_column("sample_id")

sample_sheet %>%
  head(n = 3)

For plotting, we will use the variance stabilizing transformation (vst) normalized counts (as it is quicker than rlog). To read more about normalization, please read Data transformations and visualization section of RNA-seq data anlysis guide from Bioconductor.

normalized_counts <- 
  vst(assays(rse_gene)$counts)
## converting counts to integer mode
normalized_counts[1:5, 1:2]
##                   f56466dc-3c0f-48be-8a72-cedeaef52f00
## ENSG00000278704.1                             2.648876
## ENSG00000277400.1                             2.648876
## ENSG00000274847.1                             2.648876
## ENSG00000277428.1                             2.648876
## ENSG00000276256.1                             2.648876
##                   f35f85a8-b3cd-4116-990f-94b2e432f902
## ENSG00000278704.1                             2.648876
## ENSG00000277400.1                             2.648876
## ENSG00000274847.1                             2.648876
## ENSG00000277428.1                             2.648876
## ENSG00000276256.1                             2.648876

Now, to simplify our analyses we want to have a) only Tumor Samples (unless you decide otherwise) and b) one sample per patient in the data set.

sample_sheet_one <-
  sample_sheet %>%
  filter(tcga.cgc_sample_sample_type == "Primary Tumor") %>%
  mutate(patient_id = str_extract(tcga.tcga_barcode, 
                                "[^-]{4}-[^-]{2}-[^-]{4}")) %>%
  group_by(patient_id) %>%
  # this is quick, but dirty way to take just one repeat per patient
  sample_n(1) 

# select only the samples we want to keep
normalized_counts <-
  normalized_counts[ ,  sample_sheet_one$sample_id]

# change rownames to nice patient ids
colnames(normalized_counts) <- sample_sheet_one$patient_id

normalized_counts[156:159, 1:2]
##                   TCGA-CS-4938 TCGA-CS-4941
## ENSG00000238203.8     2.648876     2.648876
## ENSG00000264393.1     2.648876     2.648876
## ENSG00000231116.9     2.648876     2.648876
## ENSG00000224979.6     2.648876     2.648876

Sample distances heatmap

Let’s calculate the variance between the genes, it will come in handy when we would like to narrow down the analyses to the most variable subset of genes.

row_var <-
  rowVars(normalized_counts)

Let’s look at the sample distances - can we see some clustering?

First, let’s calculate the distances between our samples. For the sake of speed we will only take top 35% of genes.

samples_dist <- dist(t(normalized_counts[row_var > quantile(row_var, 0.75),]))

And now, let’s see the sample distances. For that we will use heatmap. A heatmap is a visualization of a matrix in which each cell is colored according to its value.

Heatmap(as.matrix(samples_dist), 
        show_row_names = FALSE, 
        show_column_names = FALSE, 
        col = viridis::mako(100), 
        show_row_dend = FALSE)

We can see that there are some samples with high and low similarity, meaning that we can see some organization in our data.

Gene expression heatmap

Now, let’s generate the heatmap showing expression of different genes. For that purpose, we select top 1% of highly variable genes.

In here, the rows are genes and the columns are samples, meaning that each cell represents an expression value for a given gene in a given sample. In this heatmap, we visualized scaled values - meaning we subtracted the mean and dived by standard deviation. Scaling allows us to focus on the changes in expression between samples, regardless of absolute levels of expression of particular genes.

ht <-
  Heatmap(scale2(normalized_counts[row_var > quantile(row_var, 0.99),]),
        show_row_names = FALSE, show_column_names = FALSE,
        clustering_distance_rows = "pearson", name = "gene expression",
        col = viridis::viridis(100), use_raster = TRUE)

ht
## Loading required namespace: Cairo

Those heatmap need some work - we need to add more information coming from additional data sources. But first, let see what type of data we have access to.

Somatic mutations

We will now use the great TCGAbiolinks package that help accessing the vast data source of TCGA. Read more about this great package here.

Because of the unexpected maintenance of the GDC server we will have to use other means to download the mutations data. For that we will use the maftools::tcgaLoad function. Note that that function comes with the v2.8 version, the code below should tell you if you need to install the package.

Note2: This maf from maftools::tcgaLoad is generated based on the hg19 genome, while the TCGAbiolinks::GDCquery_Maf you could choose hg38 or hg19 (on default it downloaded the hg38).

maf <-
  GDCquery_Maf("LGG", pipelines = "mutect2") %>%
  read.maf(verbose = TRUE)

# now that GDC works again, we don't need this
# 
# tryCatch(maf <- tcgaLoad(study = "LGG"), 
#          error = function(e) {
#            print(paste(rep("#", 50), collapse = ""))
#            print(paste0("# ERROR! Read the message below!", 
#                         paste(rep(" ", 17), collapse = ""),
#                         "#"))
#            print(paste(rep("#", 50), collapse = ""))
#            print(e)
#            print(paste("If you're seeing this message you probably don't have",
#                        "maftools package loaded, or have an older version.", 
#                        "This function is available with v2.8.",
#                        "Install the new version of maftools package with",
#                        "`BiocManager::install('PoisonAlien/maftools')`", 
#                        "and try again!"))
#            })

First thing we can do is the mutation specific summary - what kind of mutations do we have in a sample? For that we will use functions from another great package - maftools a dedicated package to visualize maf data. You can see what brilliant things the package does and what things you can easily investigate here.

plotmafSummary(maf = maf, rmOutlier = TRUE, 
               addStat = 'median', dashboard = TRUE, 
               log_scale = TRUE)

First thing we see is that this cancer type does not have many mutations. We can compare it with other TCGA cancer types.

tcga_mutation_burden <- 
  tcgaCompare(maf = maf, cohortName = "LGG")
## Warning in FUN(X[[i]], ...): Removed 1 samples with zero mutations.
## Performing pairwise t-test for differences in mutation burden..

We can see that LGG is one of the less mutated cancers.

Now, we can look at the top mutated genes.

oncoplot(maf = maf, top = 10)

IDH1 is by far the most mutated gene - are mutations in that gene in one hotspot or distributed? Let’s visualize the mutations in this gene.

lollipopPlot(maf, "IDH1", labelPos = 'all')
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.

Clearly the R132 position is a hotspot, mutated in more than 3/4 of the samples!

The case is not so clear with TP53.

lollipopPlot(maf, "TP53")
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
## 8 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
##    HGNC    refseq.ID   protein.ID aa.length
## 1: TP53    NM_000546    NP_000537       393
## 2: TP53 NM_001126112 NP_001119584       393
## 3: TP53 NM_001126118 NP_001119590       354
## 4: TP53 NM_001126115 NP_001119587       261
## 5: TP53 NM_001126113 NP_001119585       346
## 6: TP53 NM_001126117 NP_001119589       214
## 7: TP53 NM_001126114 NP_001119586       341
## 8: TP53 NM_001126116 NP_001119588       209
## Using longer transcript NM_000546 for now.

If we would annotate all the mutations on the TP53 lolliplot we wouldn’t be able to read anything, so we need to subset the positions.

top_label <-
  maf@data %>%
  filter(Hugo_Symbol == "TP53") %>%
  group_by(HGVSp_Short) %>%
  summarise(count = n()) %>%
  top_n(5) %>%
  pull(HGVSp_Short) %>%
  str_extract("[0-9]+")
## Selecting by count
lollipopPlot(maf, "TP53", labelPos = top_label, labPosAngle = 20)
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
## 8 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
##    HGNC    refseq.ID   protein.ID aa.length
## 1: TP53    NM_000546    NP_000537       393
## 2: TP53 NM_001126112 NP_001119584       393
## 3: TP53 NM_001126118 NP_001119590       354
## 4: TP53 NM_001126115 NP_001119587       261
## 5: TP53 NM_001126113 NP_001119585       346
## 6: TP53 NM_001126117 NP_001119589       214
## 7: TP53 NM_001126114 NP_001119586       341
## 8: TP53 NM_001126116 NP_001119588       209
## Using longer transcript NM_000546 for now.

Similarly to TP53, ATRX is mutated at different positions rather than in one hotspot. Interestingly, this gene is enriched for Frameshift and Nonsense mutations.

lollipopPlot(maf, "ATRX")
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
## 2 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
##    HGNC refseq.ID protein.ID aa.length
## 1: ATRX NM_138270  NP_612114      2454
## 2: ATRX NM_000489  NP_000480      2492
## Using longer transcript NM_000489 for now.

We can also look at the cooccurences and mutual exclusivity - which genes are frequently mutated together? Which aren’t?

somaticInteractions(maf, top = 15, pvalue = c(0.01, 0.05))

Clinical data

With the recount3 data comes clinical information as well. For example, we can check the sex or ethnicity of the patients with tcga.gdc_cases.demographic. columns.

sample_sheet %>%
  select(starts_with("tcga.gdc_cases.demographic."))

There’s much more data in that dataframe - after all there are almost a thousand columns. However that’s only the data for samples in the expression matrix.

We can also access additional data from TCGAbiolinks, for example we can see in what subtypes were the samples separated into. This data will cover more patients - including those which samples were whole exome sequenced, but have no expression data and so on.

tcga_subtype_data <-
  TCGAquery_subtype(tumor = "lgg")
## lgg subtype information from:doi:10.1016/j.cell.2015.12.028
tcga_subtype_data %>%
  select(ends_with("subtype"))

The dataframe contains much more data, and the function prints out information about the publication the data comes from.

tcga_subtype_data

Now that we have information of a subtype (and other), we can add it to our heatmap.

top_annotation_df <-
  tcga_subtype_data %>%
  select(patient, Histology, Original.Subtype, Transcriptome.Subtype) %>%
  column_to_rownames("patient")

# make sure we are taking only information about samples in RNA-seq expression
top_annotation_df <-
  top_annotation_df[colnames(normalized_counts),]

# prepare colors, cause random ones are bad
# I generate those pallets  https://colorbrewer2.org/
histology_cols <- c(astrocytoma = "#d8b365",  
                    oligodendroglioma ="#99d8c9", 
                    oligoastrocytoma = "#2ca25f")

original_cols <- c("IDHmut-non-codel" = "#fee8c8",
                   "IDHwt" = "#fdbb84",
                   "IDHmut-codel" = "#e34a33")

transcriptome_cols <- c("CL" = "#7b3294", 
                        "ME" = "#c2a5cf", 
                        "NE" = "#a6dba0", 
                        "PN" = "#008837")
# create annotation
ha <-
  HeatmapAnnotation(df = top_annotation_df, 
                    which = "column", 
                    col = list("Histology" = histology_cols,
                               "Original.Subtype" = original_cols,
                               "Transcriptome.Subtype" = transcriptome_cols), 
                    na_col = "#999999")

Heatmap(as.matrix(samples_dist), 
        show_row_names = FALSE, 
        show_column_names = FALSE, 
        col = viridis::mako(100), 
        show_row_dend = FALSE, 
        top_annotation = ha)

And let’s see the transciptome heatmap with added annotation. Interestingly, even by selecting only top 1% we can see that different Transcriptome Subtypes are clustering together.

ht <-
  Heatmap(scale2(normalized_counts[row_var > quantile(row_var, 0.99),]),
        show_row_names = FALSE, show_column_names = FALSE,
        clustering_distance_rows = "pearson", name = "gene expression",
        col = viridis::viridis(100), use_raster = TRUE, top_annotation = ha)

ht

What next?

You can do much more with those data - firstly, you can draw inspiration from the vignettes of the packages. Secondly, you can see what is interesting to you, what you would want to know? Thirdly, this is already published data and you can always refer to the paper - maybe there are some figures you have an idea to improve?

Few ideas from the top of my head: * improve the heatmap by adding the subtype, sex and other variable annotations. Add the mutation in various top genes as annotations at the bottom; * reduce dimensionality of expression data with PCA or tSNE and see if you see the subtypes specific from transcriptome data * access the survival data and see how they separate based on the subtype; * visualize point mutations on the protein (for example here)

For example, in order to add IDH1 annotation, we need to join the information from two dataframes (one that has unique column ids and one that has the information about mutation).

idh1_mutation <-
  maf@data %>%
  mutate(patient_id = str_extract(Tumor_Sample_Barcode, 
                                  "[^-]{4}-[^-]{2}-[^-]{4}")) %>%
  select(Hugo_Symbol, patient_id) %>%
  filter(Hugo_Symbol == "IDH1") %>%
  column_to_rownames("patient_id")

ha_bottom <-
  idh1_mutation[colnames(normalized_counts), , drop = FALSE] %>%
  rownames_to_column("patient_id") %>%
  mutate(present = ifelse(is.na(Hugo_Symbol), FALSE, TRUE)) %>%
  group_by(patient_id) %>%
  summarise(IDH1 = any(present)) %>%
  column_to_rownames("patient_id") %>%
  HeatmapAnnotation(df = ., col = list("IDH1" = c(`TRUE` = "black", 
                                                  `FALSE` = "white")))

When we have the annotation, we can add it to the plot.

ht3 <-
  Heatmap(scale2(normalized_counts[row_var > quantile(row_var, 0.99),]),
        show_row_names = FALSE, show_column_names = FALSE,
        clustering_distance_rows = "pearson", name = "gene expression",
        col = viridis::viridis(100), 
        top_annotation = ha, bottom_annotation = ha_bottom)

ht3

Interactive plot

With using just one package and one command, we can make an interactive plot.

But before this, I want to translate the ensembl ids to hugo gene symbol to improve readability.

ensembl_ids <- 
  rownames(normalized_counts) %>%
  str_remove("\\.[0-9]*")

mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl")
ensembl_to_hgnc <-
  biomaRt::getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol'), 
                 mart = mart) %>%
  filter(ensembl_gene_id %in% ensembl_ids) %>%
  group_by(ensembl_gene_id) %>%
  summarise(hgnc_symbol = paste(unique(hgnc_symbol)[unique(hgnc_symbol) != ""],
                                collapse = ",")) %>%
  mutate(hgnc_symbol = ifelse(hgnc_symbol == "", 
                              ensembl_gene_id,
                              hgnc_symbol)) %>%
  column_to_rownames("ensembl_gene_id") 

rownames(normalized_counts) <-
  ensembl_to_hgnc[ensembl_ids,]

Now that we fixed the gene names, let’s look at the interactive heatmap using InteractiveComplexHeatmap package. Run below code in your notebook.

# to create interactive heatmaps
library(InteractiveComplexHeatmap)

ht_interactive <-
  Heatmap(scale2(normalized_counts[row_var > quantile(row_var, 0.99),]),
        show_row_names = TRUE, show_column_names = FALSE,
        clustering_distance_rows = "pearson", name = "gene expression",
        col = viridis::viridis(100), 
        top_annotation = ha, bottom_annotation = ha_bottom)

htShiny(ht_interactive)
## Loading required package: shiny
## 
## Listening on http://127.0.0.1:7767

For more information and beautiful examples check out: Documentation and GitHub.

Joining expression and mutation

# which genes are we intersted in?
genes_of_interest <- c("IDH1", "TP53")

# this will allow us to distinguish between no infomration about mutation
# and no mutation
patient_muts <-
  maf@data %>%
  mutate(patient_id = str_extract(Tumor_Sample_Barcode, 
                                  "[^-]{4}-[^-]{2}-[^-]{4}")) %>%
  pull(patient_id) %>%
  unique()

expression_vals <-
  normalized_counts[genes_of_interest,] %>%
  as.data.frame() %>%
  rownames_to_column("Hugo_Symbol") %>%
  pivot_longer(names_to = "patient_id", 
              values_to = "norm_expression",
              cols = -Hugo_Symbol)

mutation_data <-
  maf@data %>%
  filter(Hugo_Symbol %in% genes_of_interest) %>%
  mutate(patient_id = str_extract(Tumor_Sample_Barcode, 
                                  "[^-]{4}-[^-]{2}-[^-]{4}")) %>%
  dplyr::select(patient_id, Hugo_Symbol, VARIANT_CLASS)

expr_mut_df <-
  full_join(expression_vals, mutation_data) %>%
  # introduce WT for samples with no mutation in the gene
  mutate(VARIANT_CLASS = as.character(VARIANT_CLASS),
         VARIANT_CLASS = ifelse(is.na(VARIANT_CLASS),
                                         yes = ifelse(patient_id %in% patient_muts,
                                                yes = "WT",
                                                no = "NA"), 
                                         no = VARIANT_CLASS),
         VARIANT_CLASS = factor(VARIANT_CLASS, 
                                levels = c("WT", 
                                           "SNV",
                                           "deletion", "insertion",
                                           "NA")))
## Joining, by = c("Hugo_Symbol", "patient_id")
expr_mut_df %>%
  ggplot(aes(VARIANT_CLASS, norm_expression, color = VARIANT_CLASS,
             shape = VARIANT_CLASS)) +
  geom_jitter() +
  geom_boxplot(color = "black", width = 0.3, 
               alpha = 0.7, outlier.shape = NA) +
  facet_wrap(~Hugo_Symbol) +
  theme_classic() +
  theme(legend.position = "none") +
  labs(y = "vst normalized expression",
       x = "Variant classification")
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).

Project

For your project your goal is to describe the selected TCGA cohort:

Create a blogpost with description of your analysis, document your progress. Try to add interactivity to your visualizations.

You will have to present the data to the rest of the hackathon groups at the end of this event. Prepare 10 minutes of showcasing the data, your visualizations.

This is the main project of this hackathon. From now on you will be working with your teams on your own and I will be available to guide you, answer your questions. You are expected to spend few hours each day working on this.

Good luck! I’m already excited to see your data viz!